Collection of Internet

home *** CD-ROM | disk | FTP | other *** search

/ Collection of Internet / Collection of Internet.iso / infosrvr / dev / www_talk.930 / 000594_connolly@pixel.convex.com _Thu Jan 21 23:26:07 1993.msg < prev next >

Wrap

Internet Message Format | 1994-01-24 | 7KB

Return-Path: <connolly@pixel.convex.com> Received: from dxmint.cern.ch by nxoc01.cern.ch (NeXT-1.0 (From Sendmail 5.52)/NeXT-2.0) id AA22226; Thu, 21 Jan 93 23:26:07 MET Received: by dxmint.cern.ch (5.65/DEC-Ultrix/4.3) id AA14802; Thu, 21 Jan 1993 23:41:33 +0100 Received: from pixel.convex.com by convex.convex.com (5.64/1.35) id AA03107; Thu, 21 Jan 93 16:41:43 -0600 Received: from localhost by pixel.convex.com (5.64/1.28) id AA24439; Thu, 21 Jan 93 16:40:55 -0600 Message-Id: <9301212240.AA24439@pixel.convex.com> To: www-talk@nxoc01.cern.ch Subject: thoughts on the future of HTML [long] Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="cut-here" Date: Thu, 21 Jan 93 16:40:54 CST From: Dan Connolly <connolly@pixel.convex.com> --cut-here HTML was designed to be simple. Folks are supposed to be able to whack out HTML with a text editor -- no rocket science required. Also, you ought to be able to use MS/Word or the equivalent to write your documents or to view HTML documents. But it's also designed to be processed by machine -- lots of machines all over the planet. Enter SGML. It seemed like the natural choice, so Tim implemented an informal SGML parser in his WWW clients. Nobody really knew the ins and outs of SGML, so information providers who wanted to produce HTML automatically just checked to be sure the public www client grokked. Then other folks tried to write HTML parsers. We discovered that there were a lot of issues that were not covered by any spec other than the WWW source code. Then I tried to use the sgmls parser to develop an HTML to FrameMaker tool. I discovered that the WWW source code conflicted with the SGML standard. Uh oh! By now I think we all agree that we should actually use SGML to specify the syntax and structure of HTML. But I wonder: on whom rests the responsibility for validating HTML documents? This is really an HTTP issue: is it part of the protocol that the data stream is _valid_ HTML? Or is it the client's responsibility to deal with errors? I suggest that it should be the responsibility of the _server_ to produce valid HTML. Of course the client should be robust in the face of errors. But I suggest that when a client and a server differ on their interpretation of a document, the client is at fault if the document is valid, and the server is at fault if the document is not. It's too late to introduce this scenario into HTTP v0.9. But future servers should have the burden of producing valid documents. This will add complexity to the server code: it can no longer just grab the contents of any old .html file and ship it out the port. But it could, for example, fix the markup errors on the fly and write error messages to a log file. If a server knows the structure of the document it sends, it should be able to send the document using SGML, ASN/1, MIME, or whatever transport mechanism we chose. This is the real value of standardizing on SGML: the syntax is one thing, but we don't even have to use it! We have a DTD that tells, in a more abstract way, what the content of the document is. With that in mind, I suggest we make HTML2 more prescriptive than HTML. It should match the way documents are structured and processed more than the way they are typed in a text editor. For example, the following document is legal, but it's a pain to process: --cut-here Content-Type: text/x-html <html> Here's the first paragraph. It's at the out structural level.<P> The is the <em>Second</em> paragraph. <body> Here's another paragraph. <H1>Another one</h1> The last paragraph. </body> </html> --cut-here Imagine you want to parse that document and answer queries like: "show me the second paragraph of the document." HTML isn't supposed to be too sophisticated, but it _is_ supposed to model typical word-processor documents fairly well. A paragraph is a pretty fundamental chunk of information. The definition of a paragraph in HTML is much more complex than it need be. Consider the following representation of essentially the same document: --cut-here Content-Type: text/x-html <html> <body> <A>Here's the first paragraph.</A> <a>The is the <em>Second</em> paragraph.</a> <a>Here's another paragraph.</a> <H1>Another one</h1> <a>The last paragraph.</a> </body> </html> --cut-here The the elements that make up the content of the BODY are all paragraphs. Wouldn't it be a lot easier to write a formatter for the latter type of document? The original HTML design was motivated by conventional use of SGML, with shortrefs and other markup minimization features to aid keyboarding of documents. But Tim (wisely) didn't want to put those features in his parser, so we ended up with a compromise: it's fairly easy to keyboard, but it has virtually no structure. so... I suggest that future versions of HTML should have more structure. How much structure? Enough. Enough to model whatever kinds of document make a WWW node. About as much as a TeXinfo node, which is pretty similar to a FrameMaker TextFlow, or a MS/Word section. We should probably also model typical markup conventions of internet mail and USENET news. Then... the big step: hytime. I _really_ think we should look at hytime architectural forms to model things like threads, webs, hierarchies of documents, etc. I think we could use HyTime mechanisms to form an abstraction that models the structure of unix filesystems, message threads, and other typical hypertext organizations on the internet. This is how we should model "relative links." The unix ../../foo syntax is fine as a model. But we should abstract the features from that syntax, so that we can use the same model on VMS systems without ad-hockery. That syntax happens to work for most gopher holes too, but that's cheating: the gopher "path" string is supposed to be opaque. The URL spec says something about how servers that use hierarchical databases should use the unix path syntax. [What a load of hooey! Oh... Ahem... sorry.] And the syntax of most WAIS URLs' has _two_ unix paths in it. What do you do with that? I really think HyTime links and locs are a good way to model all this. The connection between HyTime and SGML is incidental. The only reason to use SGML markup is to _interchange_ information between HyTime applications (...or to talk about HyTime constructs in email, or any of the other things that text is convenient for.) Standardizing HTML was one thing: it's only used in the WWW community. But standardizing the WWW addressing architecture is a much larger venture. I hope eventually the various IETF groups etc. will realize that the HyTime community has thought about formal mechanisms to name and reference information a lot, and the product of their labors, HyTime, may have some technical merit as well as the weight [and overhead...] of an international standard. Dan p.s. I thought I subscribed to the cni-arch mailing list where URL stuff was supposed to happen. I don't remember getting anything from that list for a long time. Is there any URL discussion going on? --cut-here--